A Hierarchical Clustering and Validity Index for Mixed Data

نویسندگان

  • Rui Yang
  • Dianne Cook
  • Heike Hofmann
  • John Jackman
چکیده

This study develops novel approaches to partition mixed data into natural groups, that is, clustering datasets containing both numeric and nominal attributes. Such data arises in many diverse applications. Our approach addresses two important issues regarding clustering mixed datasets. One is how to find the optimal number of clusters which is important because this is unknown in many applications. The other is how to group the objects “naturally” according to a suitable similarity measurement. These problems are especially difficult for the mixed datasets since they involve determining how to unify the two different representation schemes for numeric and nominal data. To address the issue of constructing clusters, that is, to naturally group objects, we compare the performance of four distances capable of dealing with the mixed datasets when incorporating into a classical agglomerative hierarchical clustering approach. Based on these results, we conclude that the so-called co-occurrence distance to measure the dissimilarity performs well as this distance is found to obtain good clustering results with reasonable computation, thus balancing effectiveness and efficiency. The second important contribution of this research is to define an entropy-based validity index to validate the sequence of partitions generated by the hierarchical clustering with the co-occurrence distance. A cluster validity index called the BK index is modified for mixed data and used in conjunction with the proposed clustering algorithm. This index is compared to three well-known indices, namely, the Calinski-Harabasz index (CH), the Dunn index (DU), and the Silhouette index (SI). The results show that the modified BK index outperforms the three other indices for its ability to identify the true number of clusters. Finally, the study also identifies the limitation of the hierarchical clustering with a cooccurrence distance, and provides some remedies to improve not only the clustering accuracy but especially the ability to correctly identify best number of classes of the mixed datasets.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Determination of the Best Hierarchical Clustering Method for Regional Analysis of Base Flow Index in Kerman Province Catchments

The lack of complete coverage of hydrological data forces hydrologists to use the homogenization methods in regional analysis. In this research, in order to choose the best Hierarchical clustering method for regional analysis, base flow and related index were extracted from daily stream flow data using two parameter recursive digital filters in 43 hydrometric stations of the Kerman province. Ph...

متن کامل

Improved Automatic Clustering Using a Multi-Objective Evolutionary Algorithm With New Validity measure and application to Credit Scoring

In data mining, clustering is one of the important issues for separation and classification with groups like unsupervised data. In this paper, an attempt has been made to improve and optimize the application of clustering heuristic methods such as Genetic, PSO algorithm, Artificial bee colony algorithm, Harmony Search algorithm and Differential Evolution on the unlabeled data of an Iranian bank...

متن کامل

Using Clustering and Factor Analysis in Cross Section Analysis Based on Economic-Environment Factors

Homogeneity of groups in studies those use cross section and multi-level data is important. Most studies in economics especially panel data analysis need some kinds of homogeneity to ensure validity of results. This paper represents the methods known as clustering and homogenization of groups in cross section studies based on enviro-economics components. For this, a sample of 92 countries which...

متن کامل

A New Method for Duplicate Detection Using Hierarchical Clustering of Records

Accuracy and validity of data are prerequisites of appropriate operations of any software system. Always there is possibility of occurring errors in data due to human and system faults. One of these errors is existence of duplicate records in data sources. Duplicate records refer to the same real world entity. There must be one of them in a data source, but for some reasons like aggregation of ...

متن کامل

A tree-based measure for hierarchical data in mixed databases

The structure of the data in a mixed database can be a barrier when clustering that database into meaningful groups. A hierarchically structured database necessitates efficient distance measures and clustering algorithms to locate similarities between data objects. Therefore, existing literature proposes hierarchical distance measures to measure the similarities between the records in hierarchi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015